Can High Bandwidth and Latency Justify Large Cache Blocks

نویسندگان

Ricardo Bianchini

Thomas J. LeBlanc

چکیده

An important architectural design decision aaecting the performance of coherent caches in shared-memory multiprocessors is the choice of block size. There are two primary factors that innuence this choice: the reference behavior of application programs and the remote access band-width and latency of the machine. Several studies have shown that increasing the block size can lower the miss rate and reduce the number of invalidations. However, increasing the block size can also increase the miss rate by, for example, increasing false sharing or the number of cache evictions. Large cache blocks can also generate network contention. Given that we anticipate enormous increases in both network bandwidth and latency in large-scale, shared-memory multiprocessors, the question arises as to what eeect these increases will have on the choice of block size. We use analytical modeling and execution-driven simulation of parallel programs on a large-scale shared-memory machine to examine the relationship between cache block size and application performance as a function of remote access bandwidth and latency. We show that even under assumptions of high remote access bandwidth, the best application performance usually results from using cache blocks between 32 and 128 bytes in size. Using even larger blocks tends to increase the mean cost per reference, either because the miss rate increases or because the improvement in the miss rate is not enough to ooset the increase in the miss penalty associated with larger blocks. We also show that modifying the program to remove the dominant source of misses may not help; the modiied program could have a lower overall miss rate, but perform best with even smaller cache blocks. Since there are many factors that limit improvements in the miss rate with an increase in block size, and since the remote access bandwidth and latency limit the extent to which an improvement in the miss rate results in a lower mean cost per reference, we conclude that large cache blocks cannot be justiied in most realistic scenarios.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Preliminary Evaluation of Cache-miss-initiated Prefetching Techniques in Scalable Multiprocessors

Prefetching is an important technique for reducing the average latency of memory accesses in scalable cache-coherent multiprocessors. Aggressive prefetching can signiicantly reduce the number of cache misses, but may introduce bursty network and memory traac, and increase data sharing and cache pollution. Given that we anticipate enormous increases in both network bandwidth and latency, we exam...

متن کامل

Non-Referenced Prefetch(NRP) Cache for Instruction Prefetching

A new conceptual cache, NRP (Non-Referenced Prefetch) cache, is proposed to improve the performance of instruction prefetch mechanisms which try to prefetch both the sequential and non-sequential blocks under the limited memory bandwidth. The NRP cache is used in storing prefetched blocks which were not referenced by the CPU, while these blocks were discarded in other previous prefetch mechanis...

متن کامل

One-Level Cache Memory Design for Scalable SMT Architectures

The cache hierarchy design in existing SMT and superscalar processors is optimized for latency, but not for bandwidth. The size of the L1 data cache did not scale over the past decade. Instead, larger unified L2 and L3 caches were introduced. This cache hierarchy has a high overhead due to the principle of containment, as all the cache blocks in the upper level caches are contained in the lower...

متن کامل

H Ardware T Echniques to I Mprove the P Erformance of The

Technology trends are making communication, both on and off the microprocessor chip, more expensive relative to computation. In this dissertation, it is shown how a current-generation microprocessor spends over two-thirds of its time performing no useful work, stalled for memory. For the aggressive, modern processors that were measured, over half of the stalls due to memory result from insuffic...

متن کامل